{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Machine Learning using tf.estimator </h1>\n",
    "\n",
    "In this notebook, we will create a machine learning model using tf.estimator and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.8.0\n"
     ]
    }
   ],
   "source": [
    "import datalab.bigquery as bq\n",
    "import tensorflow as tf\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import shutil\n",
    "\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read data created in the previous chapter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In CSV, label is the first column, after the features, followed by the key\n",
    "CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']\n",
    "FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]\n",
    "LABEL = CSV_COLUMNS[0]\n",
    "\n",
    "df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)\n",
    "df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Input function to read from Pandas Dataframe into tf.constant </h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_input_fn(df, num_epochs):\n",
    "  return tf.estimator.inputs.pandas_input_fn(\n",
    "    x = df,\n",
    "    y = df[LABEL],\n",
    "    batch_size = 128,\n",
    "    num_epochs = num_epochs,\n",
    "    shuffle = True,\n",
    "    queue_capacity = 1000,\n",
    "    num_threads = 1\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create feature columns for estimator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_feature_cols():\n",
    "  input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]\n",
    "  return input_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> Linear Regression with tf.Estimator framework </h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Using default config.\n",
      "INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_worker_replicas': 1, '_is_chief': True, '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_model_dir': 'taxi_trained', '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_master': '', '_train_distribute': None, '_tf_random_seed': None, '_task_id': 0, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2a0bc86898>, '_session_config': None, '_save_checkpoints_steps': None, '_global_id_in_cluster': 0}\n",
      "INFO:tensorflow:Calling model_fn.\n",
      "INFO:tensorflow:Done calling model_fn.\n",
      "INFO:tensorflow:Create CheckpointSaverHook.\n",
      "INFO:tensorflow:Graph was finalized.\n",
      "INFO:tensorflow:Running local_init_op.\n",
      "INFO:tensorflow:Done running local_init_op.\n",
      "INFO:tensorflow:Saving checkpoints for 1 into taxi_trained/model.ckpt.\n",
      "INFO:tensorflow:loss = 27440.74, step = 1\n",
      "INFO:tensorflow:global_step/sec: 250.95\n",
      "INFO:tensorflow:loss = 15465.457, step = 101 (0.402 sec)\n",
      "INFO:tensorflow:global_step/sec: 343.634\n",
      "INFO:tensorflow:loss = 7376.7725, step = 201 (0.291 sec)\n",
      "INFO:tensorflow:global_step/sec: 343.244\n",
      "INFO:tensorflow:loss = 10833.051, step = 301 (0.292 sec)\n",
      "INFO:tensorflow:global_step/sec: 328.252\n",
      "INFO:tensorflow:loss = 10107.798, step = 401 (0.304 sec)\n",
      "INFO:tensorflow:global_step/sec: 307.038\n",
      "INFO:tensorflow:loss = 9387.074, step = 501 (0.326 sec)\n",
      "INFO:tensorflow:Saving checkpoints for 573 into taxi_trained/model.ckpt.\n",
      "INFO:tensorflow:Loss for final step: 5084.834.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x7f2a0bc86710>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.logging.set_verbosity(tf.logging.INFO)\n",
    "\n",
    "OUTDIR = 'taxi_trained'\n",
    "shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time\n",
    "\n",
    "model = tf.estimator.LinearRegressor(\n",
    "      feature_columns = make_feature_cols(), model_dir = OUTDIR)\n",
    "\n",
    "model.train(input_fn = make_input_fn(df_train, num_epochs = 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate on the validation data (we should defer using the test data to after we have selected a final model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Calling model_fn.\n",
      "INFO:tensorflow:Done calling model_fn.\n",
      "INFO:tensorflow:Starting evaluation at 2019-01-07-05:57:08\n",
      "INFO:tensorflow:Graph was finalized.\n",
      "INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-573\n",
      "INFO:tensorflow:Running local_init_op.\n",
      "INFO:tensorflow:Done running local_init_op.\n",
      "INFO:tensorflow:Finished evaluation at 2019-01-07-05:57:09\n",
      "INFO:tensorflow:Saving dict for global step 573: average_loss = 95.19483, global_step = 573, loss = 11503.929\n",
      "RMSE on validation dataset = 9.756783485412598\n"
     ]
    }
   ],
   "source": [
    "def print_rmse(model, name, df):\n",
    "  metrics = model.evaluate(input_fn = make_input_fn(df, 1))\n",
    "  print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))\n",
    "print_rmse(model, 'validation', df_valid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is nowhere near our benchmark (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Using default config.\n",
      "INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_worker_replicas': 1, '_is_chief': True, '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_model_dir': 'taxi_trained', '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_master': '', '_train_distribute': None, '_tf_random_seed': None, '_task_id': 0, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2a0a69d5f8>, '_session_config': None, '_save_checkpoints_steps': None, '_global_id_in_cluster': 0}\n",
      "INFO:tensorflow:Calling model_fn.\n",
      "INFO:tensorflow:Done calling model_fn.\n",
      "INFO:tensorflow:Graph was finalized.\n",
      "INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-573\n",
      "INFO:tensorflow:Running local_init_op.\n",
      "INFO:tensorflow:Done running local_init_op.\n",
      "[10.404693, 10.401687, 10.526818, 10.402787, 10.404966]\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "# Read saved model and use it for prediction\n",
    "model = tf.estimator.LinearRegressor(\n",
    "      feature_columns = make_feature_cols(), model_dir = OUTDIR)\n",
    "preds_iter = model.predict(input_fn = make_input_fn(df_valid, 1))\n",
    "print([pred['predictions'][0] for pred in list(itertools.islice(preds_iter, 5))])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This explains why the RMSE was so high -- the model essentially predicts the same amount for every trip.  Would a more complex model help? Let's try using a deep neural network.  The code to do this is quite straightforward as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> Deep Neural Network regression </h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Using default config.\n",
      "INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_worker_replicas': 1, '_is_chief': True, '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_model_dir': 'taxi_trained', '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_master': '', '_train_distribute': None, '_tf_random_seed': None, '_task_id': 0, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2a0934e2e8>, '_session_config': None, '_save_checkpoints_steps': None, '_global_id_in_cluster': 0}\n",
      "INFO:tensorflow:Calling model_fn.\n",
      "INFO:tensorflow:Done calling model_fn.\n",
      "INFO:tensorflow:Create CheckpointSaverHook.\n",
      "INFO:tensorflow:Graph was finalized.\n",
      "INFO:tensorflow:Running local_init_op.\n",
      "INFO:tensorflow:Done running local_init_op.\n",
      "INFO:tensorflow:Saving checkpoints for 1 into taxi_trained/model.ckpt.\n",
      "INFO:tensorflow:loss = 28607.73, step = 1\n",
      "INFO:tensorflow:global_step/sec: 245.973\n",
      "INFO:tensorflow:loss = 17859.195, step = 101 (0.413 sec)\n",
      "INFO:tensorflow:global_step/sec: 285.521\n",
      "INFO:tensorflow:loss = 28225.68, step = 201 (0.348 sec)\n",
      "INFO:tensorflow:global_step/sec: 256.532\n",
      "INFO:tensorflow:loss = 26882.078, step = 301 (0.390 sec)\n",
      "INFO:tensorflow:global_step/sec: 280.459\n",
      "INFO:tensorflow:loss = 18754.225, step = 401 (0.357 sec)\n",
      "INFO:tensorflow:global_step/sec: 261.342\n",
      "INFO:tensorflow:loss = 21997.562, step = 501 (0.383 sec)\n",
      "INFO:tensorflow:global_step/sec: 298.876\n",
      "INFO:tensorflow:loss = 25782.695, step = 601 (0.334 sec)\n",
      "INFO:tensorflow:global_step/sec: 311.085\n",
      "INFO:tensorflow:loss = 18242.92, step = 701 (0.321 sec)\n",
      "INFO:tensorflow:global_step/sec: 268.513\n",
      "INFO:tensorflow:loss = 50172.37, step = 801 (0.373 sec)\n",
      "INFO:tensorflow:global_step/sec: 293.367\n",
      "INFO:tensorflow:loss = 15899.174, step = 901 (0.341 sec)\n",
      "INFO:tensorflow:global_step/sec: 296.798\n",
      "INFO:tensorflow:loss = 15931.12, step = 1001 (0.337 sec)\n",
      "INFO:tensorflow:global_step/sec: 263.783\n",
      "INFO:tensorflow:loss = 29358.307, step = 1101 (0.379 sec)\n",
      "INFO:tensorflow:global_step/sec: 295.687\n",
      "INFO:tensorflow:loss = 16974.207, step = 1201 (0.338 sec)\n",
      "INFO:tensorflow:global_step/sec: 304.375\n",
      "INFO:tensorflow:loss = 14427.602, step = 1301 (0.329 sec)\n",
      "INFO:tensorflow:global_step/sec: 264.946\n",
      "INFO:tensorflow:loss = 21096.465, step = 1401 (0.377 sec)\n",
      "INFO:tensorflow:global_step/sec: 308.059\n",
      "INFO:tensorflow:loss = 31343.576, step = 1501 (0.325 sec)\n",
      "INFO:tensorflow:global_step/sec: 312.508\n",
      "INFO:tensorflow:loss = 12164.592, step = 1601 (0.320 sec)\n",
      "INFO:tensorflow:global_step/sec: 231.96\n",
      "INFO:tensorflow:loss = 19954.412, step = 1701 (0.431 sec)\n",
      "INFO:tensorflow:global_step/sec: 307.545\n",
      "INFO:tensorflow:loss = 14334.197, step = 1801 (0.325 sec)\n",
      "INFO:tensorflow:global_step/sec: 309.875\n",
      "INFO:tensorflow:loss = 20061.816, step = 1901 (0.323 sec)\n",
      "INFO:tensorflow:global_step/sec: 272.217\n",
      "INFO:tensorflow:loss = 11962.799, step = 2001 (0.367 sec)\n",
      "INFO:tensorflow:global_step/sec: 286.742\n",
      "INFO:tensorflow:loss = 21175.979, step = 2101 (0.349 sec)\n",
      "INFO:tensorflow:global_step/sec: 274.704\n",
      "INFO:tensorflow:loss = 10345.787, step = 2201 (0.364 sec)\n",
      "INFO:tensorflow:global_step/sec: 285.213\n",
      "INFO:tensorflow:loss = 51596.992, step = 2301 (0.350 sec)\n",
      "INFO:tensorflow:global_step/sec: 296.059\n",
      "INFO:tensorflow:loss = 18981.705, step = 2401 (0.338 sec)\n",
      "INFO:tensorflow:global_step/sec: 292.213\n",
      "INFO:tensorflow:loss = 22762.723, step = 2501 (0.343 sec)\n",
      "INFO:tensorflow:global_step/sec: 287.597\n",
      "INFO:tensorflow:loss = 10301.49, step = 2601 (0.347 sec)\n",
      "INFO:tensorflow:global_step/sec: 307.725\n",
      "INFO:tensorflow:loss = 23616.488, step = 2701 (0.325 sec)\n",
      "INFO:tensorflow:global_step/sec: 272.234\n",
      "INFO:tensorflow:loss = 12964.539, step = 2801 (0.367 sec)\n",
      "INFO:tensorflow:global_step/sec: 287.607\n",
      "INFO:tensorflow:loss = 49494.277, step = 2901 (0.348 sec)\n",
      "INFO:tensorflow:global_step/sec: 308.896\n",
      "INFO:tensorflow:loss = 17261.879, step = 3001 (0.324 sec)\n",
      "INFO:tensorflow:global_step/sec: 274.466\n",
      "INFO:tensorflow:loss = 18735.988, step = 3101 (0.365 sec)\n",
      "INFO:tensorflow:global_step/sec: 300.085\n",
      "INFO:tensorflow:loss = 17590.49, step = 3201 (0.333 sec)\n",
      "INFO:tensorflow:global_step/sec: 319.801\n",
      "INFO:tensorflow:loss = 27418.559, step = 3301 (0.312 sec)\n",
      "INFO:tensorflow:global_step/sec: 295.317\n",
      "INFO:tensorflow:loss = 13809.393, step = 3401 (0.339 sec)\n",
      "INFO:tensorflow:global_step/sec: 300.354\n",
      "INFO:tensorflow:loss = 9743.24, step = 3501 (0.333 sec)\n",
      "INFO:tensorflow:global_step/sec: 319.135\n",
      "INFO:tensorflow:loss = 19238.74, step = 3601 (0.313 sec)\n",
      "INFO:tensorflow:global_step/sec: 287.048\n",
      "INFO:tensorflow:loss = 19036.22, step = 3701 (0.348 sec)\n",
      "INFO:tensorflow:global_step/sec: 302.753\n",
      "INFO:tensorflow:loss = 6894.4805, step = 3801 (0.330 sec)\n",
      "INFO:tensorflow:global_step/sec: 288.063\n",
      "INFO:tensorflow:loss = 16138.993, step = 3901 (0.347 sec)\n",
      "INFO:tensorflow:global_step/sec: 285.423\n",
      "INFO:tensorflow:loss = 11986.52, step = 4001 (0.352 sec)\n",
      "INFO:tensorflow:global_step/sec: 298.222\n",
      "INFO:tensorflow:loss = 10430.516, step = 4101 (0.334 sec)\n",
      "INFO:tensorflow:global_step/sec: 310.681\n",
      "INFO:tensorflow:loss = 14918.604, step = 4201 (0.322 sec)\n",
      "INFO:tensorflow:global_step/sec: 293.627\n",
      "INFO:tensorflow:loss = 17170.426, step = 4301 (0.341 sec)\n",
      "INFO:tensorflow:global_step/sec: 298.63\n",
      "INFO:tensorflow:loss = 17361.8, step = 4401 (0.335 sec)\n",
      "INFO:tensorflow:global_step/sec: 319.422\n",
      "INFO:tensorflow:loss = 16511.75, step = 4501 (0.313 sec)\n",
      "INFO:tensorflow:global_step/sec: 277.057\n",
      "INFO:tensorflow:loss = 19822.902, step = 4601 (0.361 sec)\n",
      "INFO:tensorflow:global_step/sec: 302.85\n",
      "INFO:tensorflow:loss = 23425.953, step = 4701 (0.330 sec)\n",
      "INFO:tensorflow:global_step/sec: 309.169\n",
      "INFO:tensorflow:loss = 17351.13, step = 4801 (0.324 sec)\n",
      "INFO:tensorflow:global_step/sec: 279.558\n",
      "INFO:tensorflow:loss = 25098.066, step = 4901 (0.358 sec)\n",
      "INFO:tensorflow:global_step/sec: 301.065\n",
      "INFO:tensorflow:loss = 6051.5635, step = 5001 (0.332 sec)\n",
      "INFO:tensorflow:global_step/sec: 301.22\n",
      "INFO:tensorflow:loss = 12585.726, step = 5101 (0.332 sec)\n",
      "INFO:tensorflow:global_step/sec: 279.491\n",
      "INFO:tensorflow:loss = 10462.142, step = 5201 (0.358 sec)\n",
      "INFO:tensorflow:global_step/sec: 275.975\n",
      "INFO:tensorflow:loss = 20543.004, step = 5301 (0.363 sec)\n",
      "INFO:tensorflow:global_step/sec: 296.939\n",
      "INFO:tensorflow:loss = 33279.633, step = 5401 (0.336 sec)\n",
      "INFO:tensorflow:global_step/sec: 273.416\n",
      "INFO:tensorflow:loss = 17146.979, step = 5501 (0.369 sec)\n",
      "INFO:tensorflow:global_step/sec: 302.599\n",
      "INFO:tensorflow:loss = 15220.414, step = 5601 (0.327 sec)\n",
      "INFO:tensorflow:global_step/sec: 291.791\n",
      "INFO:tensorflow:loss = 11970.334, step = 5701 (0.343 sec)\n",
      "INFO:tensorflow:Saving checkpoints for 5729 into taxi_trained/model.ckpt.\n",
      "INFO:tensorflow:Loss for final step: 9290.794.\n",
      "INFO:tensorflow:Calling model_fn.\n",
      "INFO:tensorflow:Done calling model_fn.\n",
      "INFO:tensorflow:Starting evaluation at 2019-01-07-05:58:34\n",
      "INFO:tensorflow:Graph was finalized.\n",
      "INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-5729\n",
      "INFO:tensorflow:Running local_init_op.\n",
      "INFO:tensorflow:Done running local_init_op.\n",
      "INFO:tensorflow:Finished evaluation at 2019-01-07-05:58:35\n",
      "INFO:tensorflow:Saving dict for global step 5729: average_loss = 118.552666, global_step = 5729, loss = 14326.634\n",
      "RMSE on validation dataset = 10.888189315795898\n"
     ]
    }
   ],
   "source": [
    "tf.logging.set_verbosity(tf.logging.INFO)\n",
    "shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time\n",
    "model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],\n",
    "      feature_columns = make_feature_cols(), model_dir = OUTDIR)\n",
    "model.train(input_fn = make_input_fn(df_train, num_epochs = 100));\n",
    "print_rmse(model, 'validation', df_valid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this course is about!\n",
    "\n",
    "But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Benchmark dataset </h2>\n",
    "\n",
    "Let's do this on the benchmark dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datalab.bigquery as bq\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def create_query(phase, EVERY_N):\n",
    "  \"\"\"\n",
    "  phase: 1 = train 2 = valid\n",
    "  \"\"\"\n",
    "  base_query = \"\"\"\n",
    "SELECT\n",
    "  (tolls_amount + fare_amount) AS fare_amount,\n",
    "  CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key,\n",
    "  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,\n",
    "  HOUR(pickup_datetime)*1.0 AS hourofday,\n",
    "  pickup_longitude AS pickuplon,\n",
    "  pickup_latitude AS pickuplat,\n",
    "  dropoff_longitude AS dropofflon,\n",
    "  dropoff_latitude AS dropofflat,\n",
    "  passenger_count*1.0 AS passengers,\n",
    "FROM\n",
    "  [nyc-tlc:yellow.trips]\n",
    "WHERE\n",
    "  trip_distance > 0\n",
    "  AND fare_amount >= 2.5\n",
    "  AND pickup_longitude > -78\n",
    "  AND pickup_longitude < -70\n",
    "  AND dropoff_longitude > -78\n",
    "  AND dropoff_longitude < -70\n",
    "  AND pickup_latitude > 37\n",
    "  AND pickup_latitude < 45\n",
    "  AND dropoff_latitude > 37\n",
    "  AND dropoff_latitude < 45\n",
    "  AND passenger_count > 0\n",
    "  \"\"\"\n",
    "\n",
    "  if EVERY_N == None:\n",
    "    if phase < 2:\n",
    "      # Training\n",
    "      query = \"{0} AND ABS(HASH(pickup_datetime)) % 4 < 2\".format(base_query)\n",
    "    else:\n",
    "      # Validation\n",
    "      query = \"{0} AND ABS(HASH(pickup_datetime)) % 4 == {1}\".format(base_query, phase)\n",
    "  else:\n",
    "    query = \"{0} AND ABS(HASH(pickup_datetime)) % {1} == {2}\".format(base_query, EVERY_N, phase)\n",
    "    \n",
    "  return query\n",
    "\n",
    "query = create_query(2, 100000)\n",
    "df = bq.Query(query).to_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Calling model_fn.\n",
      "INFO:tensorflow:Done calling model_fn.\n",
      "INFO:tensorflow:Starting evaluation at 2019-01-07-05:59:43\n",
      "INFO:tensorflow:Graph was finalized.\n",
      "INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-5729\n",
      "INFO:tensorflow:Running local_init_op.\n",
      "INFO:tensorflow:Done running local_init_op.\n",
      "INFO:tensorflow:Finished evaluation at 2019-01-07-05:59:44\n",
      "INFO:tensorflow:Saving dict for global step 5729: average_loss = 112.467316, global_step = 5729, loss = 14294.885\n",
      "RMSE on benchmark dataset = 10.605060577392578\n"
     ]
    }
   ],
   "source": [
    "print_rmse(model, 'benchmark', df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RMSE on benchmark dataset is <b>9.41</b> (your results will vary because of random seeds).\n",
    "\n",
    "This is not only way more than our original benchmark of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.02.\n",
    "\n",
    "Fear not -- you have learned how to write a TensorFlow model, but not to do all the things that you will have to do to your ML model performant. We will do this in the next chapters. In this chapter though, we will get our TensorFlow model ready for these improvements.\n",
    "\n",
    "In a software sense, the rest of the labs in this chapter will be about refactoring the code so that we can improve it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
